AITopics | data generator

Collaborating Authors

data generator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Neural Information Processing SystemsDec-23-2025, 20:07:31 GMT

Accessible high-quality data is the bread and butter of machine learning research, and the demand for data has exploded as larger and more advanced ML models are built across different domains. Yet, real data often contain sensitive information, are subject to various biases, and are costly to acquire, which compromise their quality and accessibility. Synthetic data have thus emerged as a complement to, sometimes even a replacement for, real data for ML training. However, the landscape of synthetic data research has been fragmented due to the diverse range of data modalities, such as tabular, time series, and images, and the wide array of use cases, including privacy preservation, fairness considerations, and data augmentation. This fragmentation poses practical challenges when comparing and selecting synthetic data generators in for different problem settings. To this end, we develop Synthcity, an open-source Python library that allows researchers and practitioners to perform one-click benchmarking of synthetic data generators across data modalities and use cases. Beyond benchmarking, Synthcity serves as a centralized toolkit for accessing cutting-edge data generators. In addition, Synthcity's flexible plug-in style API makes it easy to incorporate additional data generators into the framework. Using examples of tabular data generation and data augmentation, we illustrate the general applicability of Synthcity, and the insight one can obtain.

benchmark framework, diverse use case, synthcity, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.76)
Information Technology > Software (0.59)

Add feedback

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Neural Information Processing SystemsOct-8-2025, 02:22:29 GMT

Access to high quality data is the lifeblood of AI.

data mining, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
South America > Brazil (0.04)

Genre: Research Report (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (0.68)
Health & Medicine > Health Care Technology (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Making Logic a First-Class Citizen in Network Data Generation with ML

Hè, Hongyu, Jin, Minhao, Apostolaki, Maria

arXiv.org Artificial IntelligenceOct-7-2025

Generative ML models are increasingly popular in networking for tasks such as telemetry imputation, prediction, and synthetic trace generation. Despite their capabilities, they suffer from two shortcomings: (i) their output is often visibly violating well-known networking rules, which undermines their trustworthiness; and (ii) they are difficult to control, frequently requiring retraining even for minor changes. To address these limitations and unlock the benefits of generative models for networking, we propose a new paradigm for integrating explicit network knowledge in the form of first-order logic rules into ML models used for networking tasks. Rules capture well-known relationships among used signals, e.g., that increased latency precedes packet loss. While the idea is conceptually straightforward, its realization is challenging: networking knowledge is rarely formalized into rules, and naively injecting them into ML models often hampers ML's effectiveness. This paper introduces NetNomos a multi-stage framework that (1) learns rules directly from data (e.g., measurements); (2) filters them to distinguish semantically meaningful ones; and (3) enforces them through a collaborative generation between an ML model and an SMT solver.

large language model, logic & formal reasoning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2506.23964

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry:

Telecommunications > Networks (0.88)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
(2 more...)

Add feedback

Tree-Based Deep Learning for Ranking Symbolic Integration Algorithms

Barket, Rashid, England, Matthew, Gerhard, Jürgen

arXiv.org Artificial IntelligenceAug-11-2025

Symbolic indefinite integration in Computer Algebra Systems such as Maple involves selecting the most effective algorithm from multiple available methods. Not all methods will succeed for a given problem, and when several do, the results, though mathematically equivalent, can differ greatly in presentation complexity. Traditionally, this choice has been made with minimal consideration of the problem instance, leading to inefficiencies. We present a machine learning (ML) approach using tree-based deep learning models within a two-stage architecture: first identifying applicable methods for a given instance, then ranking them by predicted output complexity. Furthermore, we find representing mathematical expressions as tree structures significantly improves performance over sequence-based representations, and our two-stage framework outperforms alternative ML formulations. Using a diverse dataset generated by six distinct data generators, our models achieve nearly 90% accuracy in selecting the optimal method on a 70,000 example holdout test set. On an independent out-of-distribution benchmark from Maple's internal test suite, our tree transformer model maintains strong generalisation, outperforming Maple's built-in selector and prior ML approaches. These results highlight the critical role of data representation and problem framing in ML for symbolic computation, and we expect our methodology to generalise effectively to similar optimisation problems in mathematical software.

artificial intelligence, expression, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.06383

Country:

North America (0.46)
Europe > United Kingdom > England (0.14)

Genre:

Research Report (0.64)
Overview (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Feng, Shangbin, Wang, Yike, Shi, Weijia, Tsvetkov, Yulia

arXiv.org Artificial IntelligenceJun-9-2025

We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms outperforms eight data generation baselines across five evaluation objectives, while Adversarial Swarms produce more robust learning of synthetic data and stronger generalization. Further analysis reveals that Data Swarms successfully optimizes compositions of multiple evaluation objectives and generalizes to new off-the-shelf LLMs, unseen at optimization time.

evolutionary algorithm, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2506.00741

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Performative Drift Resistant Classification Using Generative Domain Adversarial Networks

Makowski, Maciej, Gower-Winter, Brandon, Krempl, Georg

arXiv.org Artificial IntelligenceApr-1-2025

Performative Drift is a special type of Concept Drift that occurs when a model's predictions influence the future instances the model will encounter. In these settings, retraining is not always feasible. In this work, we instead focus on drift understanding as a method for creating drift-resistant classifiers. To achieve this, we introduce the Generative Domain Adversarial Network (GDAN) which combines both Domain and Generative Adversarial Networks. Using GDAN, domain-invariant representations of incoming data are created and a generative network is used to reverse the effects of performative drift. Using semi-real and synthetic data generators, we empirically evaluate GDAN's ability to provide drift-resistant classification. Initial results are promising with GDAN limiting performance degradation over several timesteps. Additionally, GDAN's generative network can be used in tandem with other models to limit their performance degradation in the presence of performative drift. Lastly, we highlight the relationship between model retraining and the unpredictability of performative drift, providing deeper insights into the challenges faced when using traditional Concept Drift mitigation strategies in the performative setting.

artificial intelligence, gdan, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2504.01135

Genre: Research Report > New Finding (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.36)

Add feedback

Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer

Wu, Yulun, Bergman, Doron L.

arXiv.org Artificial IntelligenceFeb-6-2025

We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without pre-training on any real-world dataset, extending on the recent development of Prior-Data Fitted Networks (PFNs) and TabPFN. Specifically, APT is pre-trained with adversarial synthetic data agents, who continue to shift their underlying data generating distribution and deliberately challenge the model with different synthetic datasets. In addition, we propose a mixture block architecture that is able to handle classification tasks with arbitrary number of classes, addressing the class size limitation -- a crucial weakness of prior deep tabular zero-shot learners. In experiments, we show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance TabPFN's performance. In our analysis, we demonstrate that the adversarial synthetic data agents were able to generate a more diverse collection of data compared to the ordinary random generator in TabPFN. In addition, we demonstrate that our mixture block neural design has improved generalizability and greatly accelerated pre-training.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.04573

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Li, Dawei, Sun, Renliang, Huang, Yue, Zhong, Ming, Jiang, Bohan, Han, Jiawei, Zhang, Xiangliang, Wang, Wei, Liu, Huan

arXiv.org Artificial IntelligenceFeb-3-2025

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.01534

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.38)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Collaborating Authors

data generator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

065e259a1d2d955e63b99aac6a3a3081-Paper-Conference.pdf

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Making Logic a First-Class Citizen in Network Data Generation with ML

065e259a1d2d955e63b99aac6a3a3081-Paper-Conference.pdf

Tree-Based Deep Learning for Ranking Symbolic Integration Algorithms

Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Performative Drift Resistant Classification Using Generative Domain Adversarial Networks

Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer

Preference Leakage: A Contamination Problem in LLM-as-a-judge